Hierarchical Clustering in Medical Document Collections: the BIC-Means Method
نویسندگان
چکیده
Hierarchical clustering of text collections is a key problem in document management and retrieval. In partitional hierarchical clustering, which is more efficient than its agglomerative counterpart, the entire collection is split into clusters and the individual clusters are further split until a heuristically-motivated termination criterion is met. In this paper, we define the BIC-means algorithm, which applies the Bayesian Information Criterion (BIC) as a domain independent termination criterion for partitional hierarchical clustering. We evaluate the effectiveness of BIC-means in clustering and retrieval on medical document collections and we propose a dynamic version of the BIC-Means algorithm for adapting an existing clustering solution to document additions.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملImplementation of Hybrid Clustering Algorithm with Enhanced K-Means and Hierarchal Clustering
We are propose a hybrid clustering method, the methodology combines the strengths of both partitioning and agglomerative clustering methods. Clustering algorithms that build meaningful hierarchies out of large document collections are ideal tools for their interactive visualization and exploration as they provide data-views that are consistent, predictable, and at different levels of granularit...
متن کاملAn Empirical Study of K-Means Initialization Methods for Document Clustering
Everyday vast amounts of documents, e-mails, and web pages are generated. In order to handle these data, automatic techniques such as document clustering are needed. The k-means method is a clustering technique widely used in practice because of its simplicity and empirical speed. In this paper, the basic k-means algorithm is augmented with two special initialization techniques that aim at impr...
متن کاملGiving an Upprebound of the Number of Clusters and Relevant Words in Hierarchical Document Clustering Based on BIC
A new generative model based approach to automatic document clustering, using the BIC as the model selection criterion is described. A new method based on a graphical model is proposed to give an upperbound to the numbers of clusters and relevant words. The result of an experiment using the NTCIR web data collection is briefly reported.
متن کاملExploiting parallelism to support scalable hierarchical clustering
A distributed memory parallel version of the group average Hierarchical Agglomerative Clustering algorithm is proposed to enable scaling the document clustering problem to large collections. Using standard message passing operations reduces interprocess communication while maintaining efficient load balancing. In a series of experiments using a subset of a standard TREC test collection, our par...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JDIM
دوره 8 شماره
صفحات -
تاریخ انتشار 2010